Gradient Descent#

Problem: minimize f(x), \(x \in \mathbb{R}^d\)

Idea: the gradient of a function

\[\nabla f(x) = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_d} \right]\]

at a point gives the direction of the steepest ascent of the function at that point. So, if we move in the opposite direction of the gradient, we should be moving towards a local minimum.

\[x_{i+1} = x_i - \alpha \nabla f(x_i)\]

where \(\alpha\) is the step size or learning rate. \(x_i\) is the current point and \(x_{i+1}\) is the next point. \(x_0\) is the initial point.

We repeat this process until \(|f'(x_i)|\) is close to 0 for some \(i\).

Note:

  • This algorithm tends to go to the closest local minimum.

  • If \(\nabla f(x_i)=0\), then we are at stationary point so we won’t move.

  • In the simplest form \(\alpha\) is fixed. For more advanced versions, it is usually adaptive or randomized.

Interactive Visualization#

1D visualization

2D visualization

Example#

Use gradient descent to solve linear regression problem.

For a single-variable linear regression model \(f(x) = wx + b\) and a dataset \(\{ (x_i, y_i) \}\), the Mean Squared Error (MSE) loss function is defined as:

\[ L(w, b) = \frac{1}{N} \sum_{i=1}^{N} (w x_i + b - y_i)^2 \]

where \(N\) is the number of data points.

Derivatives of the Loss Function#

The gradients of the loss function with respect to \(w\) and \(b\) are:

\[ \frac{\partial L}{\partial w} = \frac{2}{N} \sum_{i=1}^{N} (w x_i + b - y_i) x_i \]
\[ \frac{\partial L}{\partial b} = \frac{2}{N} \sum_{i=1}^{N} (w x_i + b - y_i) \]

Using gradient descent, the parameters \(w\) and \(b\) are updated as:

\[ w_{\text{k+1}} = w_{\text{k}} - \alpha \frac{\partial L}{\partial w} \]
\[ b_{\text{k+1}} = b_{\text{k}} - \alpha \frac{\partial L}{\partial b} \]

where \(\alpha\) is the learning rate.

%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML, display

# Step 1: Generate synthetic data
np.random.seed(42)
x = 2 * np.random.rand(100, 1)
y = 4 + 3 * x + np.random.randn(100, 1)

# for plotting the straight line
gridx = np.linspace(0, 2, 100).reshape(-1, 1)

# Step 2: Define the loss function
def compute_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Create a grid of w and b values to visualize the loss landscape
w_values = np.linspace(0, 8, 200)
b_values = np.linspace(0, 8, 200)
w_grid, b_grid = np.meshgrid(w_values, b_values)
loss_grid = np.zeros_like(w_grid)

# Calculate the loss for each combination of w and b
for i in range(w_grid.shape[0]):
    for j in range(w_grid.shape[1]):
        w, b = w_grid[i, j], b_grid[i, j]
        y_pred = w * x + b
        loss_grid[i, j] = compute_loss(y, y_pred)

# Step 3: Implement gradient descent on w and b
learning_rate = 0.3
n_iterations = 50
w = 2
b = 1

# Store values for plotting the trajectory
trajectory = [(w, b)]
y_preds = [w * gridx + b]

# Precompute gradients and predictions
ws = [w]
bs = [b]
for iteration in range(n_iterations):
    # Predictions
    y_pred = ws[-1] * x + bs[-1]

    # Compute gradients
    w_gradient = -2 * np.mean(x * (y - y_pred))
    b_gradient = -2 * np.mean(y - y_pred)

    # Update parameters
    ws.append(ws[-1] - learning_rate * w_gradient)
    bs.append(bs[-1] - learning_rate * b_gradient)

    # Save the trajectory and predictions
    trajectory.append((ws[-1], bs[-1]))
    y_preds.append(ws[-1] * gridx + bs[-1])

# Prepare the figure
fig, axs = plt.subplots(1, 2, figsize=(12, 6))

# Left subplot: Loss landscape and trajectory
levels = [0, 0.1, 1, 2, 5, 10, 15, 20, 30, 50, 75, 100]
cp = axs[0].contour(w_grid, b_grid, loss_grid, levels=levels, cmap='viridis')
axs[0].clabel(cp, inline=True, fontsize=10)
traj_plot, = axs[0].plot([], [], 'r-o', label='Gradient Descent Path')
axs[0].set_xlabel('w')
axs[0].set_ylabel('b')
axs[0].set_title('Trajectory of Gradient Descent in the Loss Landscape')
axs[0].legend()

# Right subplot: Current line over noisy data
scatter = axs[1].scatter(x, y, color='blue', label='Noisy Data')
line, = axs[1].plot([], [], 'r-', label='Current Fit')
axs[1].set_xlabel('x')
axs[1].set_ylabel('y')
axs[1].set_title('Current Fit of the Model')
axs[1].legend()

def animate(i):
    # Update trajectory plot
    traj_plot.set_data(*zip(*trajectory[:i+1]))

    # Update line plot
    y_pred = y_preds[i]
    line.set_data(gridx.flatten(), y_pred.flatten())

    return traj_plot, line

ani = animation.FuncAnimation(fig, animate, frames=n_iterations, interval=200, blit=True)

# Display the animation in the notebook
html_output = HTML(ani.to_jshtml())
display(html_output)

More gradient descent algorithms